Skip to content

Conversation

@Kunchd
Copy link
Contributor

@Kunchd Kunchd commented Oct 9, 2025

Why are these changes needed?

As discovered in the PR to better define the interface for reference counter, plasma store provider and memory store both share thin dependencies on reference counter that can be refactored out. This will reduce entanglement in our code base and improve maintainability.

The main logic changes are located in

  • src/ray/core_worker/store_provider/plasma_store_provider.cc, where reference counter related logic is refactor into core worker
  • src/ray/core_worker/core_worker.cc, where factored out reference counter logic is resolved
  • src/ray/core_worker/store_provider/memory_store/memory_store.cc, where logic related to reference counter has either been removed due to the fact that it is tech debt or refactored into caller functions.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run pre-commit jobs to lint the changes in this PR. (pre-commit setup)
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

}

Status CoreWorkerPlasmaStoreProvider::Get(
const absl::flat_hash_set<ObjectID> &object_ids,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic below lifts the owner address resolution logic into the caller of Get.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a behavior change here. Before, we'd fetch the owner addresses for each batch. Now, we'll get them all up-front in the caller. It's probably fine. Keep an eye on the perf numbers to make sure there isn't a regression tho

}
}

absl::flat_hash_map<ObjectID, rpc::Address> CoreWorker::GetObjectIdToOwnerAddressMap(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function represents the owner address resolution logic lifted out of plasma store.

for (auto &get_request : get_requests) {
get_request->Set(object_id, object_entry);
// If ref counting is enabled, override the removal behaviour.
if (get_request->ShouldRemoveObjects() && ref_counter_ == nullptr) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove after get flag is getting removed since it stems from the dark ages when reference counting wasn't a thing.

const ray::RayObject &object, const ObjectID &object_id)> object_allocator)
: io_context_(io_context),
ref_counter_(counter),
reference_counting_(reference_counting),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All reference counter null checks for whether reference counting is enabled is lifted into the caller.

void CoreWorkerMemoryStore::Put(const RayObject &object, const ObjectID &object_id) {
void CoreWorkerMemoryStore::Put(const RayObject &object,
const ObjectID &object_id,
const bool has_reference) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Memory store only uses reference counter for checking if it has reference to an object or if reference counting is enabled. Both is lifted into caller.

@Kunchd Kunchd marked this pull request as ready for review October 9, 2025 19:59
@Kunchd Kunchd requested review from a team, SongGuyang, kfstorm and raulchen as code owners October 9, 2025 19:59
@Kunchd Kunchd requested a review from israbbani October 9, 2025 19:59
@Kunchd Kunchd added the go add ONLY when ready to merge, run all tests label Oct 9, 2025
explicit DefaultCoreWorkerMemoryStoreWithThread(
std::unique_ptr<InstrumentedIOContextWithThread> io_context)
: CoreWorkerMemoryStore(io_context->GetIoService()),
: CoreWorkerMemoryStore(io_context->GetIoService(), /*reference_counting=*/false),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
: CoreWorkerMemoryStore(io_context->GetIoService(), /*reference_counting=*/false),
: CoreWorkerMemoryStore(io_context->GetIoService(), /*reference_counting=*/true),

There's no reason for the default core worker memory store to be created with reference counting disabled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not used in any local mode tests, then you should set it to true.

@ray-gardener ray-gardener bot added the core Issues that should be addressed in Ray Core label Oct 10, 2025
Copy link
Contributor

@dayshah dayshah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haven't looked at this too hard yet, but at first glance it looks like having address in that vector will make wait/get perf on large lists even worse than it already is bc the impact of the vector set copying madness will be even worse

Do we need address to be in there??

@Kunchd
Copy link
Contributor Author

Kunchd commented Oct 10, 2025

haven't looked at this too hard yet, but at first glance it looks like having address in that vector will make wait/get perf on large lists even worse than it already is bc the impact of the vector set copying madness will be even worse

Do we need address to be in there??

Nope, definitely doesn't need to be in the same data structure. I've moved it out for performance.

cursor[bot]

This comment was marked as outdated.

@Kunchd Kunchd requested a review from dayshah October 14, 2025 17:56
@jjyao
Copy link
Collaborator

jjyao commented Oct 21, 2025

Needs to rebase

@Kunchd Kunchd requested review from a team, pcmoritz and thomasdesr as code owners October 22, 2025 18:29
@Kunchd Kunchd requested review from a team, aslonnie, edoakes, jjyao and richardliaw as code owners October 22, 2025 18:29
explicit DefaultCoreWorkerMemoryStoreWithThread(
std::unique_ptr<InstrumentedIOContextWithThread> io_context)
: CoreWorkerMemoryStore(io_context->GetIoService()),
: CoreWorkerMemoryStore(io_context->GetIoService(), /*reference_counting=*/false),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not used in any local mode tests, then you should set it to true.


Status CoreWorkerPlasmaStoreProvider::GetObjectsFromPlasmaStore(
absl::flat_hash_set<ObjectID> &remaining,
absl::flat_hash_map<ObjectID, int64_t> &remaining_object_id_to_idx,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a performance optimization. Since we now take in both the owner id and address in Get() instead of resolving owner address dynamically, we need a way to associate owner_id with its owner address quickly.

}

Status CoreWorkerPlasmaStoreProvider::Get(
const absl::flat_hash_set<ObjectID> &object_ids,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a behavior change here. Before, we'd fetch the owner addresses for each batch. Now, we'll get them all up-front in the caller. It's probably fine. Keep an eye on the perf numbers to make sure there isn't a regression tho

std::vector<ObjectID> batch_ids;

int64_t num_total_objects = static_cast<int64_t>(object_ids.size());
absl::flat_hash_map<ObjectID, int64_t> remaining_object_id_to_idx;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the index for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +3093 to +3094
std::vector<ObjectID> object_ids = {return_id};
auto owner_addresses = reference_counter_->GetOwnerAddresses(object_ids);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above about overloading.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +1396 to +1398
std::vector<ObjectID> object_ids =
std::vector<ObjectID>(plasma_object_ids.begin(), plasma_object_ids.end());
auto owner_addresses = reference_counter_->GetOwnerAddresses(object_ids);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not overload GetOwnerAddresses instead of creating a vector?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, GetOwnerAddresses associate the output owner address vector with the input object_ids by index. If we take in a set instead, the ordering is now ill-defined. The current change should not cause any performance changes here as we simply moved the object_id vectorization out of plasma_store_provider and into core_worker.cc https://github.com/ray-project/ray/blob/master/src/ray/core_worker/store_provider/plasma_store_provider.cc#L262.

From offline discussion, we've concluded that the current way of storing owner address as a vector is poor cardinality as most object ids likely share the same owner address. However, the change to reduce cardinality will require significant logic changes across multiple files. GetOwnerAddress will be overloaded in a future PR that addresses the cardinality issue.

hdrs = ["store_provider/plasma_store_provider.h"],
deps = [
":common",
":core_worker_context",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay

// Prepare object ids vector and owner addresses vector
std::vector<ObjectID> object_ids =
std::vector<ObjectID>(plasma_object_ids.begin(), plasma_object_ids.end());
auto owner_addresses = reference_counter_->GetOwnerAddresses(object_ids);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Avoid Unnecessary Vector Copies

Creating a vector from a set and then calling GetOwnerAddresses() is inefficient. The PR discussion mentions "Why not overload GetOwnerAddresses instead of creating a vector?" This creates an unnecessary copy of potentially many ObjectIDs. The same issue appears at lines 3093-3094 and 3365-3367, but those involve smaller vectors so the performance impact is less severe.

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants